<<<<<<<< HEAD:modeling/04_LS8_out_class_analysis.html ======== >>>>>>>> new_adds:modeling/07_Shoreline_Contamination.html Deep Dive LS8 labels for training set <<<<<<<< HEAD:modeling/04_LS8_out_class_analysis.html ======== >>>>>>>> new_adds:modeling/07_Shoreline_Contamination.html

Purpose

<<<<<<<< HEAD:modeling/04_LS8_out_class_analysis.html

This script takes a deep dive into Landsat 8 labels for a more rigorous analysis of inconsistent band data and outliers in the filtered label dataset. Here we will determine if any more label data points should be removed from the training dataset and whether or not we can glean anything from the metadata in the outlier dataset to be able to pre-emptively toss out scenes when we go to apply the classification algorithm.

harmonize_version = "2024-04-25"
outlier_version = "2024-04-25"

LS8 <- read_rds(paste0("data/labels/harmonized_LS89_labels_", harmonize_version, ".RDS")) %>% 
  filter(mission == "LANDSAT_8")

Check for mis-matched band data between user data and re-pull

Just look at the data to see consistent (or inconsistent) user-pulled data and our pull, here, our user data are in “BX” format and the re-pull is in “SR_BX” format. These are steps to assure data quality if the volunteer didn’t follow the directions explicitly.

pmap(.l = list(user_band = LS89_user,
               ee_band = LS89_ee,
               data = list(LS8),
               mission = list("LANDSAT_8")),
     .f = make_band_comp_plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

There isn’t a ton of mis-match here, we’ll just use B7/SR_B7 as a reference to filter inconsistent labels

LS8_inconsistent <- LS8 %>% 
  filter(is.na(SR_B7) | B2 != SR_B2 | B3 != SR_B3 | 
           B4 != SR_B4 | B5 != SR_B5 | B6 != SR_B6 | B7 != SR_B7)

LS8_inconsistent %>% 
  group_by(class) %>% 
  summarise(n_labels = n()) %>% 
  kable()
class n_labels
algalBloom 1
cloud 29
darkNearShoreSediment 7
lightNearShoreSediment 11
offShoreSediment 12
openWater 5
other 1
shorelineContamination 23

None of these are inconsistent because of saturated pixels:

LS8 %>% 
  filter(is.na(SR_B7)) %>% 
  nrow()
## [1] 0

This leaves 7.7% of the Landsat 7 labels as inconsistent - we’ll dig more into this in a bit. Let’s do a quick sanity check to make sure that we’ve dropped values that are inconsistent between pulls:

LS8_filtered <- LS8 %>% 
  filter(# filter data where the repull data and user data match
         (B2 == SR_B2 & B3 == SR_B3 & 
           B4 == SR_B4 & B5 == SR_B5 & B6 == SR_B6 & B7 == SR_B7),
         # or where any re-pulled band value is greater than 1, which isn't a valid value
         if_all(LS89_ee,
                ~ . <= 1))

And plot:

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

And now let’s look at the data by class:

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

We aren’t actually modeling “other” (not sufficient observations to classify) or “shorelineContamination” (we’ll use this later to block areas where there is likely shoreline contamination in the AOI). Additionally, the “algalBloom” labels don’t have sufficient n (nor do we think these are necessarily visible), so let’s drop those categories and look at the data again. Since we’ve completed the process of comparison between the user pull and the re-pull, let’s remove the user-pulled band data, too.

LS8_for_class_analysis <- LS8_filtered %>% 
  filter(!(class %in% c("other", "shorelineContamination", "algalBloom"))) %>% 
  select(-c(B2:B7))
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

Interesting - the classes look really similar in distribution (maybe because cloud categories are so high). It will be interesting to see if there are statistical differences.

Check for systemic volunteer inconsistencies

Let’s also go back and check to see if there is any pattern to the inconsistent labels.

vol_init n_tot_labs n_dates
MRB 64 3
FYC 8 2
HAD 6 2
LRCP 7 2
AMP 2 1
SKS 2 1

Let’s look at these by image date:

L8_tot <- LS8 %>% 
  group_by(vol_init, date) %>% 
  summarise(n_tot_labels = n())
LS8_inconsistent %>% 
  group_by(vol_init, date) %>% 
  summarise(n_incon_labs = n()) %>% 
  arrange(-n_incon_labs) %>%
  left_join(L8_tot) %>% 
  mutate(perc_incon_labels = round((n_incon_labs/n_tot_labels)*100, 1)) %>% 
  arrange(-perc_incon_labels) %>% 
  kable()
vol_init date n_incon_labs n_tot_labels perc_incon_labels
MRB 2022-08-01 55 117 47.0
FYC 2014-05-23 5 82 6.1
FYC 2017-09-04 3 50 6.0
MRB 2014-06-24 1 17 5.9
SKS 2021-06-27 2 43 4.7
MRB 2020-07-10 8 177 4.5
HAD 2020-08-11 5 136 3.7
LRCP 2014-06-08 6 212 2.8
LRCP 2015-05-10 1 42 2.4
AMP 2016-09-01 2 92 2.2
HAD 2022-11-21 1 55 1.8

Let’s look to see if the scenes for 2022-08-01 have been reprocessed since the user export date.

read_csv("data/labels/collated_scene_metadata_LS8_LS9_v2024-04-25.csv") %>% 
  filter(DATE_ACQUIRED == "2022-08-01") %>% 
  pluck("LANDSAT_PRODUCT_ID")
## [1] "LC08_L2SP_026027_20220801_20220805_02_T1"
## [2] "LC08_L2SP_026028_20220801_20220805_02_T1"

Nope (the second 8-digit number is the date it was processed to L2). It’s possible that there was a scene-level issue that we’ll find later, or that the user didn’t follow directions, either way, this scene will be dropped at the end, but we’ll keep it for the time being and see if it’s problematic anywhere else.

Outlier handling

There are statistical outliers within this dataset and they may impact the interpretation of any statistical testing we do. Let’s see if we can narrow down when those outliers and/or glean anything from the outlier data that may be applicable to the the application of the algorithm. Outliers may be a systemic issue (as in the scene is an outlier), it could be a user issue (a user may have been a bad actor), or they just might be real. This section asks those questions. The “true outliers” that we dismiss from the dataset will also be used to help aid in interpretation/application of the algorithm across the Landsat stack, so it is important to make notes of any patterns we might see in the outlier dataset.

## [1] "Classes represented in outliers:"
## [1] "darkNearShoreSediment" "offShoreSediment"      "openWater"

Okay, 20 outliers (>1.5*IQR) out of 934 - and they are all from non-cloud groups, and none of them are light near shore sediment.

Scene Level Metadata

How many of these outliers are in specific scenes?

LS8_out_date <- outliers %>% 
  group_by(date, vol_init) %>% 
  summarize(n_out = n())
LS8_date <- LS8_for_class_analysis %>% 
  filter(class != "cloud") %>% 
  group_by(date, vol_init) %>% 
  summarise(n_tot = n())
LS8_out_date <- left_join(LS8_out_date, LS8_date) %>% 
  mutate(percent_outlier = n_out/n_tot*100) %>% 
  arrange(-percent_outlier)
LS8_out_date %>% 
  kable()
date vol_init n_out n_tot percent_outlier
2022-11-21 HAD 10 12 83.3333333
2022-08-01 MRB 5 16 31.2500000
2020-04-05 AMP 1 9 11.1111111
2014-06-08 LRCP 2 189 1.0582011
2020-07-10 MRB 1 97 1.0309278
2020-08-11 HAD 1 131 0.7633588

There are two scenes here that have very high outliers, one is that August 2022 scene - perhaps there is something about the AC in these particular scenes? or the general scene quality?

LS8_out_date %>% 
  filter(percent_outlier > 20) %>% 
  select(date, vol_init) %>% 
  left_join(., LS8) %>% 
  select(date, vol_init, DATA_SOURCE_AIR_TEMPERATURE:max_cloud_cover) %>% 
  distinct() %>% 
  kable()
date vol_init DATA_SOURCE_AIR_TEMPERATURE DATA_SOURCE_ELEVATION DATA_SOURCE_OZONE DATA_SOURCE_PRESSURE DATA_SOURCE_REANALYSIS DATA_SOURCE_WATER_VAPOR NADIR_OFFNADIR CLOUD_COVER_list IMAGE_QUALITY_OLI_list IMAGE_QUALITY_TIRS_list mean_cloud_cover max_cloud_cover
2022-11-21 HAD MODIS GLS2000 MODIS Calculated GEOS-5 FP-IT MODIS NADIR 44.68 9 9 44.680 44.68
2022-08-01 MRB MODIS GLS2000 MODIS Calculated GEOS-5 FP-IT MODIS NADIR 49.6; 50.79 9 9 50.195 50.79

Image quality is high across the board, and the cloud cover is generally moderate.

Let’s look at these two images:

Another case of high cloud cover and adjacent snow! We should definitely toss this scene (but we’ll wait to do that later, after we’ve looked to see if aerosol is high in the QA band step, since there is nothing from the metadata that would indicate this scene would be problematic). For consistency, let’s look at instances where outliers are in at least three bands for a given label:

date class vol_init user_label_id n_bands_out bands_out
2022-08-01 openWater MRB 1901 4 SR_B2; SR_B3; SR_B4; SR_B5
2022-08-01 openWater MRB 1902 4 SR_B2; SR_B3; SR_B4; SR_B5
2022-08-01 openWater MRB 1903 4 SR_B2; SR_B3; SR_B4; SR_B5
2022-08-01 openWater MRB 1904 4 SR_B2; SR_B3; SR_B4; SR_B5
2022-08-01 openWater MRB 1905 4 SR_B2; SR_B3; SR_B4; SR_B5

Let’s group by image date and volunteer and tally up the number of labels where at least 3 bands where outliers:

date vol_init n_labels
2022-08-01 MRB 5

We are already going to toss this scene, but let’s look at it:

This scene has the weird green cloud situation happening and the south-extending NA data on the west side of the AOI. Let’s look at image quality here:

LS8_for_class_analysis %>% 
  filter(date == "2022-08-01") %>% 
  pluck("IMAGE_QUALITY_OLI_list") %>% 
  unique() %>% 
  unlist()
## [1] "9"

That’s not helpful - the image quality is the highest it can be. The QA pixels might be more helpful here.

Clouds

How many of these outliers have near-pixel clouds (as measured by ST_CDIST)?

There are 20 labels (100% of oultiers) that aren’t “cloud” in the outlier dataset that have a cloud distance <500m and 558 labels (59.7%) in the whole dataset that have a cloud distance <500m. Since this is about the same portion of labels (or they are not severely disproportionate), I don’t think this is terribly helpful.

How many of the outliers have high cloud cover, as reported by the scene-level metadata? Note, we don’t have the direct scene cloud cover associated with individual labels, rather a list of the scene level cloud cover values associated with the AOI.

The outlier dataset contains 0 (0%) where the max cloud cover was > 75% and 0 (0%) where the mean cloud cover was > 50%. The filtered dataset contains 0 (0%) where max was >75% and 0 (0%) where the mean cloud cover was > 50%. Welp, this is unhelpful!

QA Pixels

Do any of the labels have QA pixel indications of contamination? Let’s see if the medium certainty classification in the QA band is useful here:

LS8_for_class_analysis %>% 
  mutate(QA = case_when(cirrus_conf >= 3 ~ "cirrus",
                        cloud_conf >= 3 ~ "cloud",
                        cloudshad_conf >= 3 ~ "cloud shadow",
                        snowice_conf >= 3 ~ "snow/ice",
                        TRUE ~ "clear")) %>% 
  group_by(QA) %>% 
  filter(class != "cloud") %>% 
  summarize(n_tot = n()) %>% 
  kable()
QA n_tot
clear 532
cloud shadow 26

Low confidence, like with LS5/7 is too conservative (would flag all labels), so we’ll stick with high confidence since there isn’t medium confidence in the QA bits for LS8/9.

Let’s look at the cloud shadow group to see if there is anything egregious:

LS8_for_class_analysis %>% 
  filter(cloudshad_conf >= 3, class != "cloud") %>% 
  group_by(date, vol_init) %>% 
  summarise(n_cloud_shadow = n()) %>% 
  arrange(-n_cloud_shadow) %>% 
  left_join(L8_tot) %>% 
  mutate(perc_cloud_shad = round((n_cloud_shadow/n_tot_labels)*100, 1)) %>% 
  kable()
date vol_init n_cloud_shadow n_tot_labels perc_cloud_shad
2022-11-21 HAD 12 55 21.8
2020-04-05 AMP 8 34 23.5
2016-09-01 AMP 3 92 3.3
2020-07-10 MRB 2 177 1.1
2022-08-01 MRB 1 117 0.9

We already know that the highest ranked cloud shadow scene here (2022-11-21) is also one we are going to drop, so I don’t think there is anything else to pursue here.

Aerosol QA bit

Landsat 8 and 9 feature an Aerosol QA band, derived from Band 1. We should look through the data here to see if any of the labels are in high aerosol QA pixels, which the USGS suggests should not be used.

LS8_for_class_analysis %>% 
  filter(aero_level == 3, class != "cloud") %>% 
  group_by(date) %>%
  summarise(n_high_aero_labels = n()) %>% 
  left_join(L8_tot) %>% 
  mutate(perc_high_aero = round(n_high_aero_labels/n_tot_labels*100, 1)) %>% 
  arrange(-perc_high_aero) %>% 
  kable()
date n_high_aero_labels vol_init n_tot_labels perc_high_aero
2017-09-04 24 FYC 50 48.0
2020-08-11 46 HAD 136 33.8
2022-08-01 14 MRB 117 12.0
2022-11-21 5 HAD 55 9.1
2016-09-01 3 AMP 92 3.3
2014-06-08 5 LRCP 212 2.4
2020-07-10 1 MRB 177 0.6

Let’s look at the 2020-08-11 and 2017-09-04 images. First 2020-08-11:

This image is clear as day, but if you zoom in near the Apostle Islands, you can see the haze. As suggested by the USGS, I think it is fine to just toss labels where aerosol is high.

And 2017-09-04:

Woah! I understand now why there might be some algae bloom labels in this dataset. This is very hazy - I’m also interested in the scene quality here:

LS8_for_class_analysis %>% 
  filter(date == "2017-09-04") %>% 
  pluck("IMAGE_QUALITY_OLI_list") %>% 
  unique() %>% 
  unlist()
## [1] "9"

Well that’s surprising. I guess this is truly an instance where we’re going to have to trust the LS8 Aerosol bit and mask out all high aerosol pixels and toss all labels that are flagged with high aerosol. This scene has 23 of 38 labels that are flagged as high aerosol, but I argue anything in this scene should be tossed.

If I map out the aerosol flag in GEE, it looks like this, where white is ‘high aerosol’:

Generally speaking, this is an invalid scene.

Training dataset implications

For the purposes of training data, we are going to throw out scene that had a majority of inconsistent band values between the user pull and re-pull (202-08-01), and the high aerosol scene 2017-09-04, as well as any label where aerosol is high, or any of the QA flags have high confidence for clouds, cloud shadows, snow/ice.

LS8_training_labels <- LS8_for_class_analysis %>% 
  filter(!(date %in% c("2022-08-01", "2017-09-04"))) %>% 
  filter(class == "cloud" |
           (aero_level < 3 &
              cirrus_conf < 3 &
              cloud_conf < 3 &
              cloudshad_conf < 3 & 
              snowice_conf < 3))

Testing for inter-class differences

We do want to have an idea of how different the classes are, in regards to band data. While there are a bunch of interactions that we could get into here, for the sake of this analysis, we are going to analyze the class differences by band.

Kruskal-Wallis assumptions:

  1. Data are non-Normal or have a skewed distribution
  2. There must be at least two independent groups.
  3. Data have a similar distribution across groups.
  4. Data are independent, the groups shouldn’t have a relationship to one another
  5. Each group should contain at least 5 observations

ANOVA assumptions:

  1. data are distributed normally
  2. data are independent
  3. variance across groups is similar

We can’t entirely assert sample independence and we know that variance and distribution is different for “cloud” labels, but those data also are visibly different from the other classes.

In order to systematically test for differences between classes and be able to intepret the data, we will need to know some things about our data:

  1. Are the data normally distributed (Shapiro-Wilkes)?
  2. Are there outliers that may impact interpretation?
  3. If data is non-normal, perform Kruskal-Walis test; otherwise ANOVA
  4. if the null is rejected (and there is a difference in at least one class), perform post-hoc test for pairwise comparison (Dunn test for both)

With this workflow, most classes are statistically different - below are the cases where the pairwise comparison were not deemed statistically significant:

## # A tibble: 17 × 9
##    band  group1           group2    n1    n2 statistic      p p.adj p.adj.signif
##    <chr> <chr>            <chr>  <int> <int>     <dbl>  <dbl> <dbl> <chr>       
##  1 SR_B2 darkNearShoreSe… light…    98   142    1.59   0.112  1     ns          
##  2 SR_B3 darkNearShoreSe… light…    98   142    1.99   0.0468 0.468 ns          
##  3 SR_B4 darkNearShoreSe… light…    98   142    0.279  0.780  1     ns          
##  4 SR_B5 darkNearShoreSe… light…    98   142   -1.43   0.153  1     ns          
##  5 SR_B5 offShoreSediment openW…   136    65   -0.0823 0.934  1     ns          
##  6 SR_B6 darkNearShoreSe… light…    98   142   -0.722  0.470  1     ns          
##  7 SR_B6 darkNearShoreSe… offSh…    98   136   -2.39   0.0167 0.167 ns          
##  8 SR_B6 darkNearShoreSe… openW…    98    65   -2.37   0.0179 0.179 ns          
##  9 SR_B6 lightNearShoreS… offSh…   142   136   -1.85   0.0639 0.639 ns          
## 10 SR_B6 lightNearShoreS… openW…   142    65   -1.90   0.0579 0.579 ns          
## 11 SR_B6 offShoreSediment openW…   136    65   -0.409  0.682  1     ns          
## 12 SR_B7 darkNearShoreSe… light…    98   142   -0.154  0.878  1     ns          
## 13 SR_B7 darkNearShoreSe… offSh…    98   136   -1.46   0.145  1     ns          
## 14 SR_B7 darkNearShoreSe… openW…    98    65   -1.87   0.0620 0.620 ns          
## 15 SR_B7 lightNearShoreS… offSh…   142   136   -1.44   0.149  1     ns          
## 16 SR_B7 lightNearShoreS… openW…   142    65   -1.86   0.0631 0.631 ns          
## 17 SR_B7 offShoreSediment openW…   136    65   -0.698  0.485  1     ns

Alright, all over the map here - dark near shore is still a problem, but also some issues with offshore sediment, open water, and light near shore sediment. This could be problematic. We’ll have to see how these data look and hope that ML can pick up on the subtle differences.

Let’s look at the boxplots of the non-cloud classes:

LS8_training_labels_no_clouds <- LS8_training_labels %>% 
  filter(class != "cloud")
pmap(.l = list(data = list(LS8_training_labels_no_clouds),
               data_name = list("LANDSAT_5"),
               band = LS89_ee),
     .f = make_class_comp_plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

There are 1 in SR_B2, 0 in SR_B3, 6 in SR_B4, 2 in SR_B5, 1 outliers in SR_B6, and 2 in SR_B7.

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

There are definitely some varying patterns here, let’s zoom in on the sediment classes.

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

Hmm, this is a true scatter shot. It will be interesting to see what happens in development and application.

Aggregating sediment classes and performing statistical tests

As a back up, we should consider using aggregated sediment classes, where any label of sediment is treated as a general class of “sediment”. Let’s do the same process here to test for class significance.

## # A tibble: 2 × 9
##   band  group1    group2      n1    n2 statistic      p p.adj p.adj.signif
##   <chr> <chr>     <chr>    <int> <int>     <dbl>  <dbl> <dbl> <chr>       
## 1 SR_B6 openWater sediment    65   376      1.70 0.0892 0.267 ns          
## 2 SR_B7 openWater sediment    65   376      1.65 0.100  0.300 ns

And let’s look at the scatter plots here:

And if we drop the cloud:

LS8_training_labels %>% 
  filter(agg_class != "cloud") %>% 
ggpairs(., columns = LS89_ee, aes(color = agg_class)) + 
  scale_color_colorblind() +
  scale_fill_colorblind() +
  theme_few()

Okay, this may be a reasonable backup, even if it’s kind of boring…

Export the training labels

Things to note for Landsat 8:

  • bright cloud cover and snow may impact Rrs within the waterbody leading to outliers. will need to be cautious applying the algo when snow is on the ground!
  • must mask high aerosol pixels, as they will get labeled as something else entirely because high aerosol results in green-hued areas of scenes.
  • We may need to aggregate the sediment into a single class for reliable results
  • pixels with QA flags should be dismissed from model application
write_rds(LS8_training_labels, paste0("data/labels/LS8_labels_for_tvt_", outlier_version, ".RDS"))
========

This script creates a quick visualization of user-labeled shoreline contamination (SC) and creates a new shapefile that has been edited to remove areas where SC has been detected so that we can add some uncertainty measures later in the model.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "sf"        "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "mapview"   "sf"        "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

Pull in necessary files

Load the labels file into the environment

labels <- read_csv("data/labels/collated_label_data_v2023-07-20.csv")
## Rows: 7862 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): mission, class, vol_init
## dbl  (11): lat, lon, Red, Green, Blue, B5, B6, B7, B8, B11, B12
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Load the aoi into the environment

aoi <- st_read('data/aoi/Superior_AOI_modeling.shp')
## Reading layer `Superior_AOI_modeling' from data source 
##   `/Users/steeleb/Documents/GitHub/Superior-Plume-Bloom/data/aoi/Superior_AOI_modeling.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -92.09412 ymin: 46.56115 xmax: -90.1 ymax: 47.30476
## Geodetic CRS:  NAD83

Grab the shoreline contamination labels. In this case we don’t acutally care if there is any per-mission mis match (like we did in 00 and 01 scripts), since we are just using the location identity (not the band data) to create a mask of uncertainty near the shore where there may be shoreline contamination within the AOI.

sc_labels <- labels %>%
  filter(class == 'shorelineContamination')

The crs is WGS 84, so let’s save these as a sf

sc <- st_as_sf(sc_labels, coords = c('lon', 'lat'), crs = 'EPSG:4326')

#and then convert to the crs of the aoi
aoi_crs <- st_crs(aoi)

sc <- st_transform(sc, aoi_crs)

View the data and clip the AOI

And plot to see where they are in conjunction with the shapefile

ggplot() +
  geom_sf(data = aoi) +
  geom_sf(data = sc, color = 'red') +
  theme_minimal()

As expected, this has pretty good coverage, especially since there are r nrow(sc) labels.

Let’s buffer these points by 60m and snip the polygon to account for these areas of contamination.

sc_buff <- st_buffer(sc, dist = 60)
# then perform a union so that there's only one feature
sc_buff_1 <- st_union(sc_buff)
# check to make sure
mapview(aoi) +
  mapview(sc_buff_1)

And now remove those areas from the polygon

# make geometries valid
aoi <- st_make_valid(aoi)
sc_buff_1 <- st_make_valid(sc_buff_1)

# remove areas of sc_buff from original aoi
aoi_no_sc <- st_difference(aoi, sc_buff_1)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
mapview(aoi_no_sc)

And let’s remove the tiny areas that are created near the shore that are not connected to the large polygon.

# get indiv polygons
aoi_no_sc_all <- st_cast(aoi_no_sc, 'POLYGON')
## Warning in st_cast.sf(aoi_no_sc, "POLYGON"): repeating attributes for all
## sub-geometries for which they may not be constant
# calculate the area for each
aoi_no_sc_all$area <- st_area(aoi_no_sc_all)
# sort them by size and only grab the largest one
aoi_no_sc_one <- aoi_no_sc_all %>%
  arrange(-area) %>%
  slice(1) %>%
  st_make_valid(.)

mapview(aoi_no_sc_one)

And now trim the SC layer to the modeling AOI.

sc_buff_export <- sc_buff_1 %>%
  st_intersection(., aoi) %>%
  st_make_valid(.)

mapview(sc_buff_export)

Save the files

And finally, write the shapefiles

st_write(sc_buff_export, 'data/aoi/Superior_shoreline_contamination.shp', append = F)
## Deleting layer `Superior_shoreline_contamination' using driver `ESRI Shapefile'
## Writing layer `Superior_shoreline_contamination' to data source 
##   `data/aoi/Superior_shoreline_contamination.shp' using driver `ESRI Shapefile'
## Writing 1 features with 0 fields and geometry type Multi Polygon.
st_write(aoi_no_sc, 'data/aoi/Superior_AOI_minus_shoreline_contamination.shp', append = F)
## Deleting layer `Superior_AOI_minus_shoreline_contamination' using driver `ESRI Shapefile'
## Writing layer `Superior_AOI_minus_shoreline_contamination' to data source 
##   `data/aoi/Superior_AOI_minus_shoreline_contamination.shp' using driver `ESRI Shapefile'
## Writing 1 features with 1 fields and geometry type Multi Polygon.
>>>>>>>> new_adds:modeling/07_Shoreline_Contamination.html